Towards automatic home-based sleep apnea estimation using deep learning

Apnea and hypopnea are common sleep disorders characterized by the obstruction of the airways. Polysomnography (PSG) is a sleep study typically used to compute the Apnea-Hypopnea Index (AHI), the number of times a person has apnea or certain types of hypopnea per hour of sleep, and diagnose the severity of the sleep disorder. Early detection and treatment of apnea can significantly reduce morbidity and mortality. However, long-term PSG monitoring is unfeasible as it is costly and uncomfortable for patients. To address these issues, we propose a method, named DRIVEN, to estimate AHI at home from wearable devices and detect when apnea, hypopnea, and periods of wakefulness occur throughout the night. The method can therefore assist physicians in diagnosing the severity of apneas. Patients can wear a single sensor or a combination of sensors that can be easily measured at home: abdominal movement, thoracic movement, or pulse oximetry. For example, using only two sensors, DRIVEN correctly classifies 72.4% of all test patients into one of the four AHI classes, with 99.3% either correctly classified or placed one class away from the true one. This is a reasonable trade-off between the model’s performance and the patient’s comfort. We use publicly available data from three large sleep studies with a total of 14,370 recordings. DRIVEN consists of a combination of deep convolutional neural networks and a light-gradient-boost machine for classification. It can be implemented for automatic estimation of AHI in unsupervised long-term home monitoring systems, reducing costs to healthcare systems and improving patient care.

Apnea and hypopnea are common sleep-related breathing disorders that affect 6% to 17% of adults, reaching up to 49% in older populations 1 .They are usually marked by pauses in breathing or shallow breathing, respectively.This can happen, for example, when the soft tissue at the back of the throat collapses and closes at night 2 .Because blockage in the airways slows breathing, less oxygen is sent from the lungs to the heart and body.The CO 2 level in the blood then increases due to impaired breathing, leading to episodes of sudden awakening or choking during sleep 3 .During these events, nasal airflow can cease or reduce for a short time, and the brain and body will fight to keep breathing.Sleep apnea and hypopnea have been associated with an increased risk of heart disease, diabetes, chronic kidney disease, stroke, depression, and cognitive impairment 2 .As age and the likelihood of having sleep apnea are positively correlated 4 , an aging population is expected to add more pressure on healthcare systems.
Typically, the diagnosis and detection of sleep apnea and hypopnea are based on polysomnography (PSG) tests conducted in a sleep facility or at home 5 .PSG requires the overnight recording and monitoring of several physiological signals, including electroencephalogram (EEG), electrocardiogram (ECG), electrooculogram, chin muscle activity, leg movements, respiratory effort, nasal airflow, and oxygen saturation (SpO 2 ).This information is then used to compute the Apnea-Hypopnea Index (AHI), the number of times a person has apnea or certain types of hypopnea per hour of sleep, as defined by the American Academy of Sleep Medicine (AASM) 6 .
There are many limitations with PSG tests: they are time-consuming, expensive, and inconvenient 7 .As a result, it is expected that more than 85% of people with sleep apnea or hypopnea are not diagnosed 4 .Hence, the development of portable, easy-to-use, reliable, and affordable AHI estimation tools is crucial to improve patient care.Currently, there is only one method that estimates AHI 8 , using an oximetry signal.However, this method does not estimate when a patient is sleeping, which is essential to compute AHI.Instead, it uses the information of when patients are asleep from the labels provided in the dataset (ground truth).As we will show later, oximetry signals on their own struggle to identify when patients are asleep, even when coupled with heart rate signals.This limitation can lead to poor AHI estimation.Finally, the available method does not detect events or segment oximetry signals, i.e., it does not identify the exact periods of time when apneas and hypopneas occur.
When focusing only on short time-series segments, there are many classification tools to decide whether an apnea event is present or not (see, for example, the comprehensive reviews in refs.9-11).Most results are limited to obstructive sleep apnea (OSA), the most severe and common type of sleep apnea 12 .Such detection tools for OSA classification have been based on manual feature extraction, from time and/or frequency domains of one or more physiological signals.These algorithms require expert knowledge for the design and selection of features to be extracted, which may still not include the most relevant features (information) for classification from data 13 .The procedure can be hard, time consuming, and often subject to bias or lack of generalizability 14,15 .Deep learning, on the other hand, has shifted data modeling away from "expert-driven" feature engineering toward "datadriven" feature extraction 8,11 and is now widespread across many other biomedical applications 16,17 .
Based on deep learning, this paper proposes a data-driven method, named DRIVEN, to estimate AHI and segment physiological signals.We focus on devices that can be implemented in a home care environment and were available in the considered datasets, including abdominal and thoracic movement, oxygen saturation, and R-to-R interval data.Our study employs an extensive multicenter database containing 14,370 recordings of 10,752 patients belonging to multiple ethnicities and with an average age of 68.7 years.To estimate AHI, DRIVEN detects apnea and hypopnea events, and determines when patients are sleeping or awake.To our knowledge, it is the first standalone algorithm to estimate AHI.Using abdominal movement and oximetry sensors, DRIVEN correctly classifies 72.4% of all test patients into one of the four AHI classes, with 99.3% either correctly classified or placed one class away from the true one, a reasonable trade-off between the model's performance and patient's comfort.Cross-validation and out-ofthe-domain testing show that DRIVEN is highly generalizable, achieving high performance on large external datasets not included during the training process.Given the high performance and small computational cost of DRIVEN, we expect that it could be implemented as part of a home AHI estimator system.

Data description
Three different datasets were considered in this work: the Multi-Ethnic Study of Atherosclerosis (MESA) 18 , the Men Study of Osteoporotic Fracture (MrOS) 19 , and the Sleep Heart Health Research (SHHS) 20 .These datasets are publicly available from the National Sleep Research Resource for Sleep-Related Studies (NSRR) 21 .In total, 14,370 PSG recordings were considered in our study.Each PSG recording consists of data collected from several physiological sensors during a patient's overnight sleep (lasting 8.78 h on average).The PSG recordings were collected during typical PSG evaluations in healthcare facilities.
This study focuses only on channels that can be easily implemented in a home environment and are accessible in all three datasets: pulse oximetry (SpO 2 ), thoracic movement, and abdominal movement.Data collected from various healthcare facilities had different sampling frequencies.These were normalized by resampling all channels to the maximum sampling frequency (64 Hz).Hence, all signals per patient were standardized to the same length, which simplified the training of the neural networks.PSG recordings were excluded from the study if they met at least one of the following criteria: (1) poor-quality PSG recordings given by the presence of "unsure" or "noise" labels in more than a third of the total sleep time or the repetition of only one value through the whole signal; and (2) the absence of one or more channels from the five initial channels selected for the study.After applying the exclusion criteria, a total of 10,643 sleep recordings were obtained to develop and evaluate the proposed method.Table 1 summarizes the databases.
The data was labeled following the definition of AHI (ahi_a0h3a) from the American Academy of Sleep Medicine 6 .An apnea is defined as a nasal flow reduction of more than 90%, while an hypopnea is defined as a nasal flow reduction of more than 30%, both for at least 10 s.There are several types of hypopneas, in which only the following two count towards AHI: hypopneas with a reduction of oxygen desaturation equal to 3% or more within 45 s (also labeled in the paper as hypopnea type 1) and hypopneas with an arousal within 5 s after the end of the event (hypopnea type 2) 22 .All other hypopneas (hypopnea type 3) are not considered in the calculation of AHI.In this paper, an AHI event refers to any of these events: apnea, hypopnea type 1, and hypopnea type 2. According to the AASM guidelines, patients are categorized according to their AHI: healthy (less than 5 events/h), mild (5-15 events/h), moderate (15-30 events/h), and severe (more than 30 events/h).

DRIVEN: deep-learning model for the detection of AHI events
We developed a hybrid deep-learning model named DRIVEN to (1) classify "normal" or AHI events, (apneas, hypopneas type 1, and hypopneas type 2) from data, (2) estimate the total time of sleep, and (3) estimate the AHI.In the classification step, the inputs of DRIVEN are short segments of timeseries data recorded during sleep.We explored window lengths of 10, 30, and 80 s, and selected 30 s windows as they outperformed the others at estimating AHI (see Methods for details).We focused on the following physiological signals: SpO 2 , thoracic movement, and abdominal movement.DRIVEN then learns whether a given input corresponds to a normal event or an AHI event.Figure 1 illustrates DRIVEN's pipeline.For each available physiological signal, a distinct deep convolutional neural network (CNN) is trained to extract global features from the raw 1-dimensional signals.The extracted features from all CNNs are then combined on a light gradientboost machine (LightGBM) 23 to perform the classification between normal and AHI events.This approach has proven to generalize better than just using one CNN to input all the channels (see subsection "End-to-end CNN" in Methods).Finally, for each patient, DRIVEN counts all detected AHI events to estimate the AHI.We evaluate the performance of different sensors, window sizes, and deep-learning models.See Methods for details on data pre-processing and model training, validation, and testing.
DRIVEN was trained and assessed on raw physiological signals.This avoids creating human-selected features, which can sometimes lead to bias.The data were separated by center to avoid data leakage and guarantee good performance on external datasets.SHHS was used to train and validate the model (both CNNs and LightGBM) while MESA and MROS were used as test sets to independently evaluate DRIVEN on unseen out-of-domain data from other sources.For each patient, sleep intervals were sampled every 30 s with a 15 s overlap and labeled as normal or AHI event, yielding a total of more than 15 million samples.The training and validation data (SHHS database) were originally imbalanced with a ratio of 6.4 normal events per AHI event.To overcome the bias and inaccuracy associated with classification using imbalanced data between two classes, the number of normal events was under sampled.Balancing the number of positive and negative samples is a common strategy followed in the literature 24 .

Performance of DRIVEN on sleep/awake detection
The AHI score is defined as the quotient between the total number of AHI events recorded overnight and the total time that patient slept (see subsection "AHI and severity class estimation from positive events" in Methods).Hence, to get an accurate estimation of this number, DRIVEN also needs to estimate the time the patient slept through the night.We trained the same architecture proposed in Fig. 1, using window sizes of 30 s, with the annotated labels from the dataset for awake or sleep.The training, validation, and testing methodology were similar as before.For the combination of abdominal and SpO 2 sensors, we obtained an area under the receiver operating characteristic curve (AUROC) of 0.87 and an area under the precision-recall curve (AUPRC) of 0.90 in the test dataset.We also observe that adding RRI to any combination of sensors has little effect on the performance of the algorithm and that pulse oximetry (SpO 2 ) alone is a poor sleep state predictor.The complete results of the sleep/wake classification are reported in Supplementary Fig. 1 and Supplementary Table 1.

Performance of DRIVEN on classification of AHI events
We now evaluate the performance of DRIVEN in the classification task between normal and AHI events.Results for the complete test set (MESA and MROS databases) are shown in Fig. 2a,b using either a single channel or a combinations of channels (when the sleep/awake states are also predicted by DRIVEN).AUROC and AUPRC show that models using thoracic and abdominal movement sensors each feature very similar performance, even when these two sensors are used in combination.This indicates very high mutual information between these sensors.On the other hand, adding SpO 2 significantly improves the performance, indicating additional information carried by this signal.Note that the model using the SpO 2 channel alone struggles to classify AHI events due to its poor performance to estimate sleep/awake segments (Supplementary Fig. 1 shows a relatively low AUROC and AUPR of 0.72 and 0.76, respectively, on the sleep/awake detection when using only SpO 2 ). Figure 2c shows performance metrics as a function of the threshold when using two sensors: abdominal movement and SpO 2 .Both accuracy and F1-Event-Classification curves are relatively flat for a wide range of thresholds showing that the model is robust to the choice of threshold (which was decided on the validation results).Note that estimating hypopneas type 2 requires detecting short-term arousals, which are hard to estimate from the three available sensors.In contrast, apneas and hypopneas type 1 are much easier to detect (Supplementary Figs. 2 and 11).

Performance of DRIVEN on AHI estimation
This section evaluates the performance of DRIVEN at estimating the AHI for each patient on the test set.It consists of first estimating the total sleep time and then counting all detected AHI events throughout the night for each patient.Using both abdominal movement and SpO 2 sensors, Fig. 3a shows the real versus predicted AHI while Fig. 3b contains the confusion matrix of the four severity categories.In total, 72.4% of patients are correctly classified and 99.3% were either correctly classified or were placed one class away from the true one.No healthy patient was classified as severe.Among the 5355 test patients, only one severe individual was classified as healthy.This occurred because the patient had a short sleep duration of 3.5 hours with frequent apneic bursts.DRIVEN exhibits consistent performance across all AHI severity levels, effectively identifying sleep-disordered breathing events in all patient groups.The model provides a good trade-  off between accuracy and patient's sleep comfort, when compared to a full polysomnography.Results for other combination of sensors are reported in Supplementary Figs.3-8.
To investigate whether some physiological signals may contribute more than others in estimating AHI, we calculated F1 scores for different combinations of sensors (Table 2; further details can be found in Supplementary Figs.3-8).Similar to the classification results above, thoracic and abdominal movement sensors each have similar performance, while adding SpO 2 significantly improves the performance.The combination of abdominal movement and SpO 2 sensors results in one of the highest performance and using only two sensors.We also compared the AHI classification F1 scores when AHI is estimated (1) without sleep information, (2) with ground truth on the sleep/awake states, and (3) with the predicted sleep/awake states (Supplementary Table 2).Using abdominal and SpO 2 sensors, the F1 performance drops around 4% from the ground-truth sleeping information to the predicted sleep.For SpO 2 alone, however, the drop is considerably larger at 12%, as this sensor alone cannot estimate sleep states as well.
Automatic detection of AHI events: segmentation of whole sleep studies Using DRIVEN's classification of AHI events, we can analyze whole sleep studies to identify apnea and certain hypopneas (types 1 and 2).Additionally, with DRIVEN's sleep state detection, we can also identify when the patient is asleep and when the patient is awake.For each patient, recordings were sampled every 15 s with 30 s windows.Each window was labeled as awake or sleep according to the sleep prediction classification.Then, those segments classified as sleep are further categorized into normal or AHI events.Figure 4 illustrates the method on a random patient, revealing the periods of sleep with the highest AHI event episodes, compared with the ground truth, as well as awake segments.

Discussion
This paper introduced a method, DRIVEN, to (1) classify AHI events and sleep states, (2) estimate AHI, and (3) segment whole sleep studies, using a single or a combination of easy to use at home sensors: abdominal movement, thoracic movement, and oxygen saturation.We showed that a deep neural network could be effectively trained using an extensive database to extract relevant features from raw physiological data.The detection was enhanced by combining it with a LightGBM classifier, which proved to generalize better than using a CNN on its own, as in ref. 25.Moreover, DRIVEN's architecture allows for each channel to have its own sampling frequency, and it estimates both AHI events and sleep stages.The best tradeoff between model performance and patient comfort was achieved when combining data from only two sensors (abdominal movement and pulse oximetry), yielding an F1-score of 72% at estimating AHI.Hence, DRIVEN can provide an estimation of AHI and detection and segmentation of whole sleep studies to aid physicians make a final diagnosis.For patients, it can guide them to seek medical attention.
Thoracic and abdominal movement sensors each featured very similar performance, since they both measure breathing.Adding SpO 2 significantly improved the performance, which is expected given the role of SpO 2 in the clinical definition of AHI.However, SpO 2 alone struggled to estimate the total sleep time per patient and, for this reason, it should not be used alone.Hence, the combination of SpO 2 and abdominal movement sensors provided an overall good performance both in detecting AHI events and estimating the total sleep time per patient.Supplementary Table 3 reports the performance of DRIVEN on AHI-event classification using all possible combinations of physiological signals available in the considered datasets.RRI data, extracted from ECGs, provided only a marginal improvement combined with other sensors.On its own, it resulted in a poor performance.Nasal airflow was also not included in the main analysis (Figs. 2 and 3) since it is an uncomfortable sensor to sleep with, and provided similar performance as either thoracic movement or abdominal movement sensors.
The reasons for choosing CNNs to automatically search for features over domain knowledge feature extraction methods are mainly due to the variety of channels to evaluate and the large available datasets.The flexibility of deep-learning feature extraction approach allowed us to explore several choices of physiological channels (including all possible combinations).DRIVEN's architecture allows us to easily and efficiently evaluate new sensors, such as adding a movement sensor if it is available.Several studies have shown that, when the training datasets are large and there is enough computing power, deep learning performs as well or better than other machine-learning algorithms 26 , as demonstrated in different applications ranging from chess games and speech recognition to computer vision 27 .In the particular case of obstructive apnea detection, 28 reported consistently better results using unbiased deep-learning approaches and CNNs for segment classification in PSG studies.
A general limitation, not just of DRIVEN but also any other method without access to EEG data, is detecting hypopneas type 2. This type of hypopnea requires estimating short-term arousal following a drop of at least 30% in nasal airflow, which is particularly challenging to detect from the  three sensors selected in this study (Supplementary Fig. 2). Figure 4 and Supplementary Fig. 10 also show a few false positives and some 30 s segments with a high probability of being AHI events, which in fact are hypopneas type 3 (that are not considered in AHI, Supplementary Fig. 11).Indeed, detecting AHI events (apneas and hypopneas types 1 and 2) is considerably more challenging than simply detecting obstructive sleep apneas (Supplementary Fig. 2), which is the main focus of most available algorithms 28 .A more specific limitation of DRIVEN is its use of 30 s windows, which might treat bursts of multiple short-lived apnea or hypopnea as a single long event.While the smoothing window has merits by effectively filtering noise and capturing longer events, it could lead to underestimating the number of apnea or hypopnea, especially in cases of severe AHI.This effect can be seen in Supplementary Fig. 10, which is a zoomed-in segment of Fig. 4. The smoothing effect caused individual apnea events of bursts to be merged into one event, resulting in a lower AHI estimation.The complex influence of different bursting patterns of apnea events on AHI is also discussed in the literature.For example, 29 demonstrated that very different bursting patterns could lead to the same AHI values.In future studies, we plan to explore new types of AI models, including selective state-space neural networks 30 , to potentially further improve performance and address these limitations.Finally, we investigated classifying sleep or awake from combinations of the three sensors considered in the paper (Supplementary Table 1 and Supplementary Fig. 1).There was a decrease in performance of DRIVEN when using predicted sleep (Fig. 2) instead of the annotated total sleep time per patient obtained from the dataset (Supplementary Fig. 12), as summarized in Supplementary Table 2.In practice, DRIVEN's performance could be further improved by combining these sensors with other easy-towear sensors like smartwatches or sensors under the mattress.This result demonstrates the potential of DRIVEN for an automatic labeling application.Compared to a Polysomnographic Technologist's manual detection and diagnosis of OSA, computer-assisted signal analysis systems can reduce errors related to inter-and intra-operation variability and tiredness induced by the arduous annotating process 31 .Furthermore, most computer-based analyses can be conducted more cost-effectively and quickly 13 .Overall, DRIVEN could reduce the burden on Polysomnographic Technologists and costs to the healthcare system by making feasible the examination of patients through multiple nights from data that can be collected at home by wearable devices.

Data specification
The polysomnography data is available in European Data Format (EDF) 32 , while the profusion files, in XML format, serve as the annotated label files.These files have been annotated for each sleep stage using the Rechtschaffen & Kales (R&K) criteria 33 .Each profusion file includes annotations for every 30-second sample of sleep stages and instances of sleep problems, with information on initiation and duration time, as well as the sensor used to detect the anomaly.Databases are summarized as follows.  .The blue areas represent true events (zero indicates no event and one indicates an AHI event).The output of DRIVEN is illustrated with symbols that represent, for each 30 s window, the probability of the window to be classified as an AHI event.The windows are colored according to their classification, depending on whether they are above or below the determined threshold of 0.79.The black crosses represent the segments that were classified as awake, the green triangles the ones classified as not AHI events, and the red stars are the windows classified as AHI events.The second plot zooms in a 1h segment.Supplementary Fig. 10 increases even further the resolution to a 15 min interval.Supplementary Fig. 11  For the labeling of events, a window is considered AHI event positive if it is satisfies one of the AHI conditions (apnea, hypopnea type 1, and hypopnea type 2).To obtain AHI, the total number of AHI events throughout the night is divided by the total number of hours of sleep.

Data pre-processing and model training
First, data is split by measurement channel (SpO 2 , abdominal movement, thoracic movement, RRI and airflow).For each patient, the initial and final 30 min were removed from the recording as they are considered set up times.Then, for each channel per patient, all signals are standardized and normalized (to speed up training) by calculating the z-scores of 95% of the data points and applying min-max normalization to eliminate the bias of mean and variance of the raw one-dimensional signal 34 .The RRI data was inferred from ECG data by measuring the time difference between heartbeats (from one R peak to the next R peak) using the Pan-Tompkins algorithm 35,36 .Subsequently, each channel's frequency was set to 64Hz, which is the maximum frequency of the measurements, using the cubic spline as the interpolation method 37 ; this method is simple to implement and, at the same time, it does not attenuate higher frequency components of the signal.Finally, the signals were segmented into windows of 30 s and labeled as "normal" and "AHI events".Data windows correspond to the same time instance across all channels (Fig. 1a).
Normalized samples of 30 s from each channel are then fed in parallel into distinct deep CNNs.Each CNN is independently trained on a specific channel to classify the AHI events from the normal windows, and then used for automatic feature extraction (Fig. 1b).The training was performed balancing the number of normal and AHI events fed to the network.For this task, the number of normal samples was randomly downsampled to match the amount of positive samples.Once the networks have been trained, features are extracted and concatenated to integrate information from different physiological biomarkers.This is achieved by using the concatenated features to train a LightGBM classifier for binary classification between normal and positive AHI-event windows (Fig. 1c).
The CNNs were trained and cross-validated with 7,753,500 samples, 1,042,652 positives and 6,710,848 negatives, from 5288 PSG recordings in the SHHS database.We use the EfficientNetV2 architecture, a deep CNN with 479 layers developed by Google in 2021 38 .It is a modified and optimized version of EfficientNet 39 , a popular image classification algorithm that won the ImageNet 2019 competition 40 .The architecture used in this paper has been modified to handle unidimensional data and perform binary classification.Each CNN was independently trained using raw unidimensional physiological data from pulse oximetry, RRI, thoracic movement, abdominal movement, or nasal airflow.Categorical cross-entropy was used as the loss function and ADAM as the optimizer 41 .If the validation loss did not decrease after eight consecutive epochs, the training was terminated.Once the networks were trained, the final two layers were removed and the last global average pooling layer is used to yield 1280 features from each data channel.

Feature classifier
After the neural networks are trained, and features are extracted, the next step is to concatenate the features extracted by all CNNs and use them for training a LightGBM classifier.In contrast to a large number of other wellknown algorithms, such as XGBoost 42 and GBDT 43 , LightGBM employs the classification algorithm for growing trees in a leaf-wise manner rather than in a depth-wise manner.The leaf-wise algorithm can converge significantly faster than the depth-wise growth method, although its growth can be subject to overfitting if the appropriate hyperparameters are not used 23 .We use a random search method within a specified set of parameters to optimize the training and performance of the LightGBM.This allows a fixed number of parameters from a particular distribution to be sampled instead of testing all the values of potential parameters 44 .The output of the feature classifier is the probability that the evaluated segment contains AHI events according to the AHI definition.The training was performed with a balanced dataset.The complete datasets, however, are imbalanced (more negative AHI events than positive ones).Hence, on the validation data, we determined the probability threshold that optimizes the performance metrics of the classifier.This probability threshold was chosen so that the precision and the recall metrics have the same value in the validation dataset (Supplementary Fig. 9a).

Models and channels comparison
We evaluated longer and shorter window sizes for event classification and AHI estimation on a subset of 1000 randomly selected patients in a 3-fold cross-validation experiment, and selected 30 s windows for their higher performance in the validation set.Additionally, we noticed a delay in the response of SpO 2 signal when a respiratory event occurred; thus, we also evaluated the model performance using different delays in the SpO 2 signal (10, 15, 20, and 25 s delays).For AHI-event classification, we evaluated the feature extraction at different layers of the neural network (model called LGBM-1 for the last layer and LGBM-3 for the third last layer), and the performance of neural networks trained on each channel separately.In summary, we evaluated 9 different channels (abdominal, thoracic, nasal airflow, RRI, SpO 2 , SpO 2 +10s delay, SpO 2 +15s delay, SpO 2 +20s delay, and SpO 2 +25s delay), two different models (LGBM applied to the combination of different sensors extracting the information at the third last layer or at the last layer of each EfficientNetV2), and three different window sizes (10s, 30s, and 80s).Note that every time SpO 2 is used, all its delayed signals are also used as separate channels.The summary of these results are reported in Supplementary Table 3 and the complete set of results is in Supplementary Table 4.These tables show that the 10s windows yield consistently lower performance and RRI signals provide negligible performance improvements.Hence, we now focus on the window sizes of 30 s and 80 s, as well as the channels abdominal, thoracic, and SpO 2 , and use the training data as described below.The complete performance results are reported in Supplementary Table 5, which also includes the estimation of AHI.On the validation dataset, these tables show that a larger window size leads to better classification of AHI events, whereas a lower window size results in better AHI estimation.Larger window sizes are less suited for counting the AHI events needed for AHI estimation as multiple AHI events are counted as a single detection.Since most AHI events last 30 s or less, the 30 s window presents a good trade-off between event classification and AHI estimation.Hence, we finally selected the 30 s windows because the objective of this paper is on AHI estimation.The 30 s windows also allow us to do a better whole sleep study segmentation.
End-to-end CNN DRIVEN's architecture consists of a combination of CNNs and LightGBM.Here, we compare its performance with that of a single CNN that inputs all channels and then performs the final classification.For this model, we used the same NN, EfficientNetV2, and changed the input layer to accept a vector for each channel.As in the previous section, we compared the AHI-event classification as well as the AHI classification of patients with the same 1000 randomly selected patients.Supplementary Table 4 shows that the validation results are quite similar, with the end-to-end CNNs having a slightly lower performance compared with the combination of CNNs and LightGBM.However, the performance of the end-to-end CNNs drop significantly in the test datasets, which are also detailed in this spreadsheet.The end-to-end CNNs tend to overfit the training data and the combination of CNNs and LightGBM shows better generalization across datasets.This result has also been observed by others (see, for example, 25 ).  5. All the results and figures presented in the paper are from trial 2, as we only trained and validated with patients from the SHHS database and tested the model's performance to close the domain gap when applying our algorithm to the other two independent databases.

Sleep/awake classification
The calculation of AHI requires the total time the patient has slept, excluding awake times throughout the night.Hence, it is important to discriminate between sleep and awake intervals.The profusion files per patient were already labeled with this information, which we could use to compute the total sleeping time.This information can also be estimated from other sensors like photoplethysmograms (PPG) 45 or accelerometers.Here, to provide a standalone solution, we investigated classifying sleep or awake from combinations of only the available three sensors considered in the paper.We used the same architecture (CNN+LGBM) and model comparison methodologies to estimate if a 30 s window is sleep or awake.Additionally, we investigated adding an extra sensor, RRI, already discarded for AHI-event prediction.The results are presented in Supplementary Table 1 and Supplementary Fig. 1.The performance is high for most combinations of sensors.The exception is SpO 2 alone, which struggles to correctly differentiate between sleep and awake windows.Adding an RRI sensor does not improve the sleep prediction performance either with the best classifiers or alone.This sleep prediction, with a threshold of 0.5 to classify sleep or awake, is then used for the AHI estimation of each patient.

AHI and severity class estimation from positive events
As the data were imbalanced, we determined the final classification threshold as the point of intersection of the precision and the recall curves on the validation data (Supplementary Fig. 9a).Then, for each patient in the validation dataset, we went through the overnight study classifying all 30 s windows sampled every 15 s (hence, consecutive windows overlap by 15 s).Each 30s window was first classified as awake or sleep, and the sleep segments were further classified into positive AHI event or negative by using the previously defined threshold (Fig. 4).Next, we calculated the ratio between the number of positive AHI events over the total sleep segments.With these ratios, one per patient, we fitted a linear regression to the value of AHI of these patients.In this regression, we used weights to account for the imbalance of the number of patients in different AHI classes.For the test set, we used the threshold selected before (Supplementary Fig. 9b), and obtained the AHI event vs sleep windows ratio as before.Finally, we used the previously determined linear regression factors to obtain the AHI of the test patients.Recall that the patient severity class was determined following the rules provided by the American Academy of Sleep Medicine 6 .

Performance metrics
The performance of the models was evaluated based on the following metrics.Let true positive (TP) represent the number of AHI events that are accurately predicted.True negative (TN) represents the number of normal events that are accurately predicted.False negative (FN) represents the number of AHI events incorrectly predicted as normal events, and false positive (FP) represents the number of normal events incorrectly predicted as AHI events.The accuracy of the model is given by (TP + TN)/(TP + TN + FP + FN), indicating the probability of correctly identifying normal and AHI events; the recall is given by TP/(TP + FN), indicating the probability of identifying AHI events; the specificity is given by TN/(TN + FP), indicating the probability of detecting normal events; the precision is given by TP/(TP + FP), indicating the ratio between detected AHI events and all predicted AHI-event cases; and the F1-score 2(precision × recall)/(precision + recall), indicating the harmonic mean of precision and recall.For every method, the receiver operating characteristic curve, and the precisionrecall curve were obtained for each possible classification threshold.The area under both of these curves allows us to compare the global performance of each architecture.Finally, the patient severity classification results were evaluated by assessing the F1 score of each class and then averaging them.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Fig. 1 |
Fig. 1 | DRIVEN's pipeline.a Data are separated by channels and segmented into 30 s windows.b For each channel, a distinct trained deep CNN extracts features (outputs) from the raw signal (input).c The extracted features are concatenated and fed to a trained LightGBM that classifies the input data between normal and AHI events (apneas, hypopneas type 1, and hypopneas type 2).

Fig. 2 |
Fig. 2 | Performance of DRIVEN on the classification of AHI events.a Receiveroperator characteristic and (b) precision-recall curves.Note the overlapping curves for thoracic and abdominal sensors.c Threshold-dependent performance metrics of DRIVEN when using two input channels (abdominal movement and SpO 2 ).The performance results are shown for all patients in the test dataset.The accuracy, precision, recall, and F1-Event-Classification are metrics on the classification of individual events.The F1-AHI-Classification measures the F1-Score of predicting the AHI severity category (healthy, mild, moderate, severe) on whole sleep studies.

Fig. 3 |
Fig. 3 | Performance of DRIVEN on the estimation of AHI. a Real versus predicted AHI divided by the four AHI severity groups.b Confusion matrix.The performance is evaluated on the test data considering a threshold of 0.79 and a combination of two signals (abdominal movement and SpO 2 ).

1 .
The MrOS Sleep Study: MrOS is a substudy of the Men Study of Osteoporotic Fractures.Between 2000 and 2002, a baseline evaluation was performed on 5994 elderly men 65 years of age or older taken from six clinical institutions.Between December 2003 and March 2005, a total of 3135 of these participants were subjected to complete unattended polysomnography and 3 to 5-day actigraphy tests as part of the sleep study.The purpose of the Sleep Study was to determine the extent to which sleep disorders are linked to adverse health outcomes, such as the increased risk of death, fractures, falls, and cardiovascular disease.2. The MESA Sleep Study: The Multi-Ethnic Study of Atherosclerosis (MESA) is a collaborative 6-center longitudinal investigation of factors associated with the development and progression of subclinical to clinical cardiovascular disease in 6814 black, white, Hispanic, and Chinese American men and women aged 45 to 84 years in 2000-2002.The participants were all between the ages of 45 and 84 at the time of the study.Four follow-up examinations have been performed, one each in the years 2003-2004, 2004-2005, 2005-2007, and 2010-2011.Furthermore, 2237 people participated in a Sleep Exam conducted by

Fig. 4 |
Fig.4| Automatic labeling of AHI events for a random patient using two sensors (abdominal movement and SpO 2 ).The blue areas represent true events (zero indicates no event and one indicates an AHI event).The output of DRIVEN is illustrated with symbols that represent, for each 30 s window, the probability of the window to be classified as an AHI event.The windows are colored according to their classification, depending on whether they are above or below the determined Dataset partition into train, validation, and test We performed three different trials with the complete dataset: (1) SHHS and MROS patients as training and validation sets in an 80-20% random split and MESA patients as test set, (2) SHHS patients as training and validation sets in an 80-20% random split and MESA and MROS as test set, and (3) SHHS, MROS and MESA patients as training, validation and test sets in a 60-20-20% random split.Because there are patients in SHHS and MROS databases that have two recordings, to avoid data leakage we were careful to always place them in the same partition.The performance metrics for validation and test sets of each trial can be found in Supplementary Table

Table 1 |
Data description

Table 2 |
F1 scores for different combinations of sensors includes the ground truth labels divided by apnea and different hypopnea types.MESA Sleep between 2010 and 2012.This exam included an unattended overnight polysomnogram, wrist-worn actigraphy for seven days, and a sleep questionnaire.The sleep study aims to determine whether there is a correlation between subclinical atherosclerosis and sex, ethnicity, or other demographic differences in sleep and sleep disorders.3. The SHHS Sleep Study: The Sleep Heart Health Study is a multicenter cohort study conducted by the National Institute of Heart, Lung, and Blood.The purpose of the study was to investigate the effects of sleepdisordered breathing on the cardiovascular system and also on other aspects of a person's health and to determine whether breathing problems that occur during sleep are related to an increased risk of coronary heart disease, stroke, death from all causes, and hypertension.Between November 1995, and January 1998, SHHS Visit 1 research focused on participants who were at least 40 years old and included 6441 men and women.A second polysomnogram, known as SHHS Visit 2, was performed on 3295 participants during the third exam cycle (January 2001 to June 2003).